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Abstract 


Activation Functions introduce non-linearity in the deep neural networks. This 
nonlinearity helps the neural networks learn faster and efficiently from the 
dataset. In deep learning, many activation functions are developed and used 
based on the type of problem statement. ReLU’s variants, SWISH, and MISH are 
Volume 2, Issue 2, July 2022 goto activation functions. MISH function is considered having similar or even 

; : ’ better performance than SWISH, and much better than ReLU. In this paper, we 
HeSeveO: = Te Maren 2022 propose an activation function named APTx which behaves similar to MISH, but 
Accepted : 21 June 2022 requires lesser mathematical operations to compute. The lesser computational 
Published: 05 July 2022 requirements of APTx does speed up the model training, and thus also reduces 
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1. Introduction 


Theability of deep learning models to learn features directly from thedata have madeit a default approach to 
solve many complex problems. A simple artificial neuron is linear in nature, also expressed in Equation (1). 


y ==w.x, +b .(1) 
where, 
y is output from theneuron; 
x, is the input to the neuron; 
w, isthe associated weights; and 
bis the associated bias. 


When the output of this neuron is passed to an activation function thenonlinearity gets introduced inthe 
network. When considering an activation function oneimportant thing is that the derivativeof an activation 
function should not bethe samein its domain. Generally, activation function f is applied to the output of the 
neuronsinthe hidden layers to makethe neural network learn complex features as expressed in Equation (2). 
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Output =fly) w(2) 


The SWISH activation function is considered better than the Re_U function and its variants. But, recently 
developed activation function MISH is considered equivalent or even better than SWISH activation function in 
some cases. 


In this paper, we propose an activation function APTx which behaves similar to the MISH activation 
function but requires lesser mathematical operations. It means lesser computation is required in APTx to 
calculate output in theforward propagation, as aresult significantly reducing the hardware requirements for 
training and inference phases. The derivative of APTx also has lesser operations than MISH , hence making 
neural networks train faster compared to MISH activation function. 


2. Related Works 


Vinod Nair and Geoffrey (2010) studied the effect of Rectified Linear Units (ReLU) on Restricted Boltzmann 
Machines. Agarap (2022) madeuse of ReLU with convolutional neural networks on theM NIST dataset which 
outperformed the CNN with softmax on classification task. Glorot et al. (2011) and Sun 4 al. (2014) discussed 
the sparsity of ReLU as areason for its better performance. Szandala (2021) performed a comparative analysis 
showing tanh and sigmoid function both having vanishing gradient problens overcome by ReLU, and showing 
thedying-ReLU problem for negativevalues. Mass & al. (2013) presented an improved version of ReLU called 
Leaky-ReLU where instead of having zero value for negative input the function will have some negative 
number output. Clevert 4 al. (2015) proposed an ELU function that was faster and better than both ReLU and 
Leaky-ReLUs. Ramachandran é& al. (2017) presented SWISH activation function having superior performance 
than ReLU and its variants. Misra (2019) proposed an activation function MISH having similar, and in some 
cases even better performancethan SWISH activation function. 


3. Proposed APTx Activation Function 


Weare proposing an activation function named as “Alpha Plus Tanh Times” or APTx in short. Our APTx 
function is presented as ¢ in Equation (3), and its derivativeis shown in Equation (4). 


@ (x) =(a +tanh (Bx)) * yx ...(3) 

'(x) = (a +tanh (Ax) + Bx sech? (Bx) (4) 
By updating the values of the parameters a, 8 and ywecan makethefunction® behavelikeaMISH activation 
function. The updated function @ and it’s derivativeis shown in Equation (5) and (6), wherea=1, 8 =land 
y="2 

@(x) =(1 +tanh (x)) *x/2 ...(5) 

@'(x) =(1 +tanh (x) +x sech? (x))/ 2 ...(6) 

For the detailed visual analysis of the behavior of our APTx its graph is shown in Figure1, and thegraph 
of its derivativeis shownin Figure 2. 


Figure 1: Graph of our Proposed APT x Activation Function at a=1, B=land y=% 
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Derivative of 
APTx 


Figure 2: Graph of the Derivative of APTx Activation Function ata=1, B=land y=% 


Although one decides activation functions based on the type of the problem statement, there are some 
popular activation functions whose comparisons were already done in existing research works. First, we 
discuss how MISH activation function is better than SWISH, ELU, Leaky-ReLU, ReLU, Tanh and Sigmoid 
activation function for general scenarios. Afterwards, we compared theMISH activation function with our 
proposed APTx function. 


4, Comparative Analysis of Existing Activation Functions 


Thesigmoid activation function is mathematically expressed in Equation (7), and comparison of its derivative 
with the derivative of tanh is shown in Figure 3. One can easily notice in Figure 3 that the range of tanh 
derivativesislarger than sigmoid derivatives, but for numbers away from zero both tanh and sigmoid have 
very less output, this introduces the Vanishing Gradient Problem (Szandala, 2021) inthelarger neural networks. 


Sigmoid (x) =1/ (14+€*) (7) 


4 — Derivative of tanh 


<== Derivative of 
Sigmoid 


Figure 3: Graph Showing Derivatives of tanh and Sigmoid Activation Functions 


The ReLU activation function provided a solution to the Vanishing Gradient Problem at least for the 
positiveinputs (Glorotet al., 2011; Sun & al., 2014), but for thenegativeinputs it suffers from the Dying-ReLU 
problem (Szandala, 2021), as its derivativefor negative valueis Zero. Leaky-ReLU (Mass e al., 2013) was able 
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tosolvetheDying-ReLU problem upto someextent. ELU (Cleverté al., 2015) showed better performancethan 
Leaky-ReLU in most of thetasks as it tends to converge cost to zero faster and produce accurate results. For the 
positive input ReLU, Leaky-ReLU, and ELU all behavein thesamemanner, but the differencelies for the non- 
positive values as shown in Figure4 and also presented in Equations (8), (9), and (10) respectively. 


8 --- ReLU 
co od Leaky-ReLU 
ELu 


Figure 4: Graph of ReLU, Leaky-ReLU (with a = 0.05), and ELU (with a = 2) 


ReLU (x) =max (0, x) (8) 
Leaky ReLU (x) ={ax, x<=0} and {, x>0} (9) 
ELU (x)= {a(e-1), x <=0} and &, x>0} ...(10) 


SWISH activation function (Ramachandran et al., 2017) performs better than ReLU activation function, and 
also its variants because noneof these variants havemanaged to reolacetheinconsistent gains (i.e, calculation 
of derivatives). SWISH can beconsidered atypeof self-gated function, also expressed in Equation (11). 


SWISH (x) =x x Sigmoid (x) ...(11) 


Although introduction of SWISH solved both vanishing gradient and providing consistent gains, development 
of MISH activation function (Misra, 2019) turned outto provideequivalent and in many tasks it had even better 
performancethan SWISH activation function. Its mathematical form is presented in Equation (12). 


MISH (x)=x x tanh (In(1+e)) ...(12) 
Graphs of the derivatives of SWISH and MISH functions are plotted in Figure5. 


4 === Derivative of MISH 


——— Derivative of SWISH 


Figure 5: Graph of the Derivatives of SWISH and MISH Activation Functions 
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5. Comparative Analysis of MISH with Proposed APTx 


During theforward propagation themathematical operations required to calculateA PTx expressed in Equation 
(5) are lesser than the MISH activation function shown in Equation (12). But, similar to the MISH function, 
APTxis bounded below and unbounded above. 


The biggest advantage of A PTx lies during thetraining phase while performing backpropagation. Back 
Propagation requires calculation of derivatives for each epoch and A PTx requires fewer mathematical operations 
to compute its derivative than the MISH activation function. The derivative of MISH is expressed in 
Equation (13), and for comparative analysis the derivativeof A PTx is stated again in the Equation (14), where 
a=1,8 =landy=% 

MISH (x) =(€(4(x +1) +4e* +e* +e(4x +6)))/ (2 +e*% +2)? ...(13) 

@ (x) =(1 +tanh (x) +x sech?(x))/ 2 ...(14) 

Interestingly, despite the fact that the derivative of theA PTx function requires fewer operations than the 
derivative of MISH and also SWISH. The derivative graphs of APTx and MISH are presented in Figure 6 
showing similar behavior, useful for backpropagation. 


—-—- Derivative of MISH 


Derivative of APTx 


Figure 6: Graph of Derivatives of MISH and APTx with a=1, 6 =land y=% 


Even moreoverlapping between MISH , and APTx derivatives can be generated by varying valuesfor a, 8 
and yparameters. 


6. Conclusion 


MISH has similar or even better performance than SWISH which is better than the rest of the activation 
functions. Our proposed activation function A PTx behaves similar to MISH but requires lesser mathematical 
operationsin calculating valuein forward propagation, and derivativesin backward propagation. This allows 
APTxtotrain neural networks faster and be ableto run inference on low-end computing hardwares such as 
neural networks deployed on low-end edge-devices with Internet of Things. 
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